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Abstract 



Optimistic parallelization is a promising approach for the parallelization of irregular 
algorithms: potentially interfering tasks are launched dynamically and the runtime sys- 
tem detects conflicts between concurrent activities, aborting and rolling back conflicting 
tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm 
like dense matrix multiplication, the amount of parallelism can usually be expressed as a 
function of the problem size, so it is reasonably straightforward to determine how many 
processors should be allocated to execute a regular algorithm of a certain size (this is 
called the processor allocation problem). In contrast, parallelism in irregular algorithms 
can be a function of input parameters, and the amount of parallelism can vary dramati- 
cally during the execution of the irregular algorithm. Therefore, the processor allocation 
problem for irregular algorithms is very difficult. 

In this paper, we describe the first systematic strategy for addressing this problem. 
Our approach is based on a construct called the conflict graph, which (i) provides insight 
into the amount of parallelism that can be extracted from an irregular algorithm, and 
(ii) can be used to address the processor allocation problem for irregular algorithms. We 
show that this problem is related to a generalization of the unfriendly seating problem and, 
by extending Turan's theorem, we obtain a worst-case class of problems for optimistic 
parallelization, which we use to derive a lower bound on the exploitable parallelism. Fi- 
nally, using some theoretically derived properties and some experimental facts, we design 
a quick and stable control strategy for solving the processor allocation problem heuristi- 
cally. 
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1 Introduction 



The advent of on-chip multiprocessors has made parallel programming a mainstream con- 
cern. Unfortunately writing correct and efficient parallel programs is a challenging task for 
the average programmer. Hence, in recent years, many projects [fl4l [lOl |3[ |2Ql have tried 
to automate parallel programming for some classes of algorithms. Most of them focus on 
regular algorithms such as Fourier transforms U9j [19| and dense linear algebra routines Jl|. 
Automation is more difficult when the algorithms are irregular and use pointer-based data 
structures such as graphs and sets. One promising approach is based on the concept of amor- 
phous data parallelism [17J. Algorithms are formulated as iterative computations on work-sets, 
and each iteration is identified as a quantum of work (task) that can potentially be executed 
in parallel with other iterations. The Galois project [18j has shown that algorithms formu- 
lated in this way can be parallelized automatically using optimistic parallelization): iterations 
are executed speculatively in parallel and, when an iteration conflicts with concurrently exe- 
cuting iterations, it is rolled-back. Algorithms that have been successfully parallelized in this 
manner include Survey propagation [5], Boruvka's algorithm [6J, Delauney triangulation and 
refinement l|T2| , and Agglomerative clustering ||2"T|. 

In a regular algorithm like dense matrix multiplication, the amount of parallelism can 
usually be expressed as a function of the problem size, so it is reasonably straightforward 
to determine how many processors should be allocated to execute a regular algorithm of a 
certain size (this is called the processor allocation problem). In contrast, parallelism in irreg- 
ular algorithms can be a function of input parameters, and the amount of parallelism can 
vary dramatically during the execution of the irregular algorithm ffT6| . Therefore, the pro- 
cessor allocation problem for irregular algorithms is very difficult. Optimistic parallelization 
complicates this problem even more: if there are too many processors and too little parallel 
work, not only might some processors be idle but speculative conflicts may actually retard 
the progress of even those processors that have useful work to do, increasing both program 
execution time and power consumption. This pape^resents the first systematic approach to ad- 
dressing the processor allocation problem for irregular algorithms under optimistic parallelization, and 
it makes the following contributions. 



We develop a simple graph-theoretic model for optimistic parallelization and use it 
to formulate processor allocation as an optimization problem that balances parallelism 
exploitation with minimizing speculative conflicts (Section |2jl. 

We identify a worst-case class of problems for optimistic parallelization; to this purpose, 
we develop an extension of Turan's theorem [2] (Section |3j. 

Using these ideas, we develop an adaptive controller that dynamically solves the proces- 
sor allocation problem for amorphous data-parallel programs, providing rapid response 
to changes in the amount of amorphous data-parallelism (Section [4]). 



2 Modeling Optimistic Parallelization 

A typical example of an algorithm that exhibits amorphous data-parallelism is Dalauney 
mesh refinement, summarized as follows. A triangulation of some planar region is given, 
containing some "bad" triangles (according to some quality criterion). To remove them, each 
bad triangle is selected (in any arbitrary order), and this triangle, together with triangles 
that lie in its cavity, are replaced with new triangles. The retriangulation can produce new 
bad triangles, but this process can be proved to halt after a finite number of steps. Two bad 
triangles can be processed in parallel, given that their cavities do not overlap. 

1 A brief announcement of this work has been presented at SPAA'll |23| 
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(i) (ii) (iii) 

Figure 1: Optimistic parallelization. (i) Nodes represent possible computations, edges conflicts 
between them, (ii) m nodes are chosen at random and run concurrently, (iii) At runtime the 
conflicts are detected, some nodes abort and their execution is rolled back, leaving a maximal 
independent set in the subgraph induced by the initial nodes choice. 

There are also algorithms, which exhibit amorphous data-parallelism, for which the order 
of execution of the parallel tasks cannot be arbitrary, but must satisfy some constraints (e.g., 
in discrete event simulations the events must commit chronologically). We will not treat this 
class of problems in this work, but we will focus only on unordered algorithms [ 16]. A different 
context in which there is no roll-back and tasks do not conflict, but obey some precedence 
relations, is treated in (TJ. 

Optimistic parallelization deals with amorphous data-parallelism by maintaining a work- 
set of the tasks to be executed. At each temporal step some tasks are selected and specula- 
tively launched in parallel. If, at runtime, two processes modify the same data a conflict is 
detected and one of the two has to abort and roll-back its execution. Neglecting the details of 
the various amorphous data-parallel algorithms, we can model their common behavior at a 
higher level with a simple graph-theoretic model: we can think a scheduler as working on a 
dynamic graph Gf = (V t ,E t ), where the nodes represent computations we want to do, but we 
have no initial knowledge of the edges, which represent conflicts between computations (see 
Fig. [TJ. At time step t the system picks uniformly at random m t nodes (the active nodes) and 
tries to process them concurrently. When it processes a node it figures out if it has some con- 
nections with other executed nodes and, if a neighbor node happens to have been processed 
before it, aborts, otherwise the node is considered processed, is removed from the graph and 
some operations may be performed in the neighborhood, such as adding new nodes with 
edges or altering the neighbors. The time taken to process conflicting and non-conflicting 
nodes is assumed to be the same, as it happens, e.g., for Dalauney mesh refinement. 

2.1 Control Optimization Goal 

When we run an optimistic parallelization we have two contrasting goals: we both want 
to maximize the work done, achieving high parallelism, but at the same time we want to 
minimize the conflicts, hence obtaining a good use of the processors time. (Furthermore, 
for some algorithms the roll-back work can be quite resource-consuming.) These two goals 
are not compatible, in fact if we naively try to minimize the total execution time the system 
is forced to use always all the available processors, whereas if we try to minimize the time 
wasted from aborted processes the system uses only one processor. Therefore in the following 
we choose a trade-off goal and cast it in our graph-theoretic model. 

Let G = (y,£) be a computations /conflicts (CC) graph with n = \V\ nodes. When 
a scheduler chooses, uniformly at random, m nodes to be run, the ordered set n m {-) by 
which they commit can be modeled as a random permutation: if i < j then n m (i) commits 
before n m (j) (if there is a conflict between n m {i) and n m (j) then n m (i) commits and 7t m (j) 
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aborts, if rc m (i) aborted due to conflicts with previous processes 7i m (j) can commit, if not 
conflicting with other committed processes). Let kt(n m ) be the number of aborted processes 
due to conflicts and r t (n m ) G [0,1) the ratio of conflicting processors observed at time t (i.e. 
Tt{n m ) — kt(n m )/m). We define the conflict ratio r t (m) to be the expected r that we obtain 
when the system is run with m processors: 

r t {m) = E nm [n{7t m )] , (1) 

where the expectation is computed uniformly over the possible prefixes of length m of the n 
nodes permutations. The control problem we want to solve is the following: given r(r) and 
m T for t < t, choose m t = pi t such that f t (ji t ) ~ p, where p is a suitable parameter. 

Remark 1. If we want to dynamically control the number of processors, p must be chosen 
different from zero, otherwise the system converges to use only one processor, thus not being 
able to identify available parallelism. A value of p € [20%, 30%] is often reasonable, together 
with the constraint m t > 2. 



3 Exploiting Parallelism 

In this section we study how much parallelism can be extracted from a given CC graph and 
how its sparsity can affect the conflict ratio. To this purpose we obtain a worst case class of 
graphs and use it to analytically derive a lower bound for the exploitable parallelism (i.e., an 
upper bound for the conflict ratio). We make extensive use of finite differences (i.e., discrete 
derivatives), which are defined recursively as follows. Let / : Z — > R be a real function 
defined on the integers, then the z'-th (forward) finite difference of / is 

A i f (k)=A i f - 1 (k + l)-A i f - 1 (k) , with A?(fc) = /(fc) . (2) 

(In the following we will omit A's superscript when equal to one, i.e., A = A 1 .) 

First, we obtain two basic properties of f, which are given by the following propositions. 

Proposition 1. The conflict ratio function f(m) is non-decreasing in m. 

To prove Prop. [T] we first need a lemma: 

Lemma 1. Let k(m) = ~E nm [k(Ti m )]. Then k is a non-decreasing convex function, i.e. Aj(m) > 
and A?(m) > 0. 

Proof. Let k(jt m ,i) be the expected number of conflicting nodes running r = m + i nodes 
concurrently, the first m of which are n m and the last i are chosen uniformly at random 
among the remaining ones. By definition, we have 

En m [k{n m ,i)] =k(m + i) . (3) 

In particular, 

k(n m ,l) =k(n m ) +Pr[(m + l)~th conflicts] , (4) 

which brings 

k(m + l) =E nm [k(n m ,l)) =k{m) + n , (5) 
with n = k(m + 1) — k(m) = Aj(m) > 0, hence proving the monotonicity of k. Consider now 

k(n mr 2) = k{n m ) + Pr [(m + l)-th conflicts] + Pr [(m + 2)-th conflicts] . (6) 

If the (m + l)-th node does not add any edge, then we have 

Pr \{m + l)-th conflicts] = Pr [{m + 2)-th conflicts] , (7) 

but since it may add some edges the probability of conflicting the second time is in general 
larger and thus A?(m) > 0. □ □ 
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Prop. [I] Since r(m) = k(m) / m, its finite difference can be written as 

. , mA% (m) — k(m) , , 

m{m + l) 

Because of Lemma [l] and being £(1) =0 we have 

k(m + 1) < mA- k (m) , (9) 

which finally brings 

, , mAAm) —kim) k(m + l)—k(m) Ar(m) , s 

m(m + l) m(m + l) m[m + l) 

□ □ 

Proposition 2. Let G be a CC graph, with n nodes and average degree d, then the initial derivative of 
f depends only on n and d as 

Mi) = / d \ • (11) 
v ; 2(n-l) 

Proo/. Since 

A,(D - "ifi^ = ^ , (12) 

we just need to obtain k{2). Let k be defined as in the proof on Lemma |l|and 5Ti = v a node 
chosen uniformly at random. Then 



dy 

n-1 



[dy] _ d n ^ 
n-1 ~ n-1 ' ( 3) 



□ □ 



k(2) =E V [k{v,l)] =E V 



A measure of the available parallelism for a given CC graph has been identified in H15H 
considering, at each temporal step, a maximal independent set of the CC graph. The expected 
size of a maximal independent set gives a reasonable and computable estimate of the available 
parallelism. However, this is not enough to predict the actual amount of parallelism that a 
scheduler can exploit while keeping a low conflict ratio, as shown in the following example. 

Example 1. Let G = K n i U D n where K n i is the complete graph of size n 2 and D n a disconnected 
graph of size n (i.e. G is made up of a clique of size n 2 and n disconnected nodes). For this graph every 
maximal independent set is maximum too and has size n + 1, but if we choose n + 1 nodes uniformly 
at random and then compute the conflicts we obtain that, on average, there are only 2 independent 
nodes. 

A more realistic estimate of the performance of a scheduler can be obtained by analyzing 
the CC graph sparsity. The average degree of the CC graph is linked to the expected size of 
a maximal independent set of the graph by the following well known theorem (in the variant 
shown in O or (22 j): 

Theorem 1. (Turdn, strong formulation). Let G = (V, E) be a graph, n = \V\ and let d be the 
average degree of G. Then the expected size of a maximal independent set, obtained choosing greedily 
the nodes from a random permutation, is at least s = n/(d + 1). 

Remark 2. The previous bound is existentially tight: let K n d be the graph made up of s = 
n/ (d + 1) cliques of size d + 1, then the average degree is d and the size of every maximal 
(and maximum) independent set is exactly s. Furthermore, every other graph with the same 
number of nodes and edges has a bigger average maximal independent set. 
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The study of the expected size of a maximal independent set in a given graph is also 
known as the unfriendly seating problem |[7j 13 and is particularly relevant in statistical physics, 
where it is usually studied on mesh-like graphs |TT| . The properties of the graph K'^ has 
suggested us the formulation of an extension of the Turan's theorem. We prove that the 
graphs K n A provide a worst case (for a given degree d) for the generalization of this problem 
obtained by focusing on maximal independent set of induced subgraphs. This allows, when 
given a target conflict ratio p, the computation of a lower bound for the parallelism a scheduler 
can exploit. 

Theorem 2. Let G be a graph with same nodes number and degree of and let EM m (G) be the 
expected size of a maximal independent set of the subgraph induced by a uniformly random choice of 
m nodes in G, then 

EM m (G) > EM m (K A ) . (14) 
To prove it we first need the following lemma. 

Lemma 2. The function rjj(x) = TT; = i( n — i — x) is convex for x G [0, n — j]. 
Proof We prove by induction on / that, for x G [0, n — j], 

tjj(x) > , rjfa) < , rjfix) > . (15) 

Base case Let r/o{ x ) = 1- The properties above are easily verified. 

Induction Since rjj(x) = Y];_\ ix) (n — j — x), we obtain 

Vj(x) = -Vj-i(x) + ^ - j - x)rj' hl (x) , (16) 
which is non-positive by inductive hypotheses. Similarly, 

rj'j'{x) = -^(x) + {n-j-x)rfj_ x (x) (17) 

is non-negative. □ □ 

Thm. |2] Consider a random permutation n of the nodes of a generic graph G that has the same 
number of nodes and edges of K!l. We assume the prefix of length m of n (i.e. 7r(l), . . . , n(m)) 
forms the active nodes and focus on the following independent set IS m in the subgraph in- 
duced: a node v is in IS m (G, n) if and only if it is in the first m positions of n and it has no 
neighbors preceding it. Let b m (G) be the expected size of IS m (G, n) averaged over all possible 
7r's (chosen uniformly): 

b m (G)^E^[#IS m (G,7r)] . (18) 

Since for construction b m {G) < EM m (G) whereas b m (K^) = EM m (K^), we just need to prove 
that b m (K^) < b m (G). Given a generic node v of degree d v and a random permutation n, its 
probability to be in IS m (G, n) is 

Pr[,GlS4G,7r)] = If;n^^ • d9) 

;=1 !=1 

By the linearity of the expectation we can write b as 



m i~ l n — i — d 



(20) 



b m {G) = l -Y J En 

n n — i 

n o=Oi /=l i=l n 1 

j—l i'=l 7=1 i—1 
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To prove that EM W (G) > EM m (K^) is thus enough showing that 



V; E 



fj(n - i - d v ) 



i=l 



J 

r 

i=l 



>fl("-2-E,K]) , 



(22) 



which can be done applying Jensen's inequality |13], since in Lemma [2] we have proved the 
convexity of rjAx) = E[i = i( n — i — x). □ □ 

Corollary 1. TTze worst case /or a scheduler among the graphs with the same number of nodes and 
edges is obtained for the graph K n d {for which we can analytically approximate the performance, as 
shown in 33. IP. 



Proof. Since 



the thesis follows. 



m — EM W (G) 1 . . 

r(m) = — - = 1 — — EM,„ (G) , 

m m 

□ 



(23) 

□ 



3.1 Analysis of the Worst-Case Performance 

Theorem 3. Let d be the average degree of G = (V, E) with n = \V\ (for simplicity we assume 
n/ (d + 1) S N). The conflict ratio is bounded from above as 



r{m) < 1 



m(d + 



1)\ Hn + l-i 



i=i 



(24) 



Proo/ Let s = n/(d + l)be the number of connected components in K^. Because of Thm. |2 
and Cor. [T] it suffices to show that 



(25) 



The probability for a connected component k of not to be accessed when m nodes are 
chosen is given by the following hypergeometric 



Pi[k not hit] 



n-d-l\ (d + l 



m 



™n-d-i 
I = \ n + 1 - i 



(26) 



Let Xfc be a random variable that is 1 when component k is hit and otherwise. We have 
that E [X/t] = 1 — TI;=i w+i-f anc ^' by the linearity of the expectation, the average number of 
components accessed is 



E 



k=l 



□ 



Corollary 2. WTien n and m increase the bound is well approximated by 



r(m) < 1 



m(d + V 



m \ 



(27) 

□ 

(28) 
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Proof. Stirling approximation for the binomial, followed by low order terms deletion in the 
resulting formula. □ □ 

Corollary 3. If we set m = as = -f^ we obtain 

1 

r(m) < 1 

a. 



d + 1 



d+1 



< 1 - - \1 - e 



(29) 



4 Controlling Processors Allocation 

In this section we will design an efficient control heuristic that dynamically chooses the num- 
ber of processes to be run by a scheduler, in order to obtain high parallelism while keeping 
the conflict ratio low. In the following we suppose that the properties of Gt are varying slowly 
compared to the convergence of m t toward fit under the algorithm we will develop (see § |4.1[ |, 
so we can consider Gf = G and ji t = ]d and thus our goal is making m t converge to p. 

Since the conflict ratio is a non-decreasing function of the number of launched tasks m 
(Prop. [TJ we could find m ~ \l by bisection simply noticing that 

F(m') < p < ?{m") => m' <}i< m" . (30) 

The control we propose is slightly more complex and is based on recurrence relations, i.e., 
we compute m t+ \ as a function F of the target conflict ratio p and of the parameters which 
characterize the system at the previous timestep: 

m f+i = F (p,rt,m t ) ■ (31) 

The initial value niQ for a recurrence can be chosen to be 2 but, if we have an estimate of the 
CC graph average degree d, we can choose a smarter value: in fact applying Cor. [3] we are 
sure that using, e.g., m = 2 {d+\) P rocessors we wm have at most a conflict ratio of 21.3%. 

Our control heuristic (Algorithm [l]| is a hybridization of two simple recurrences. The first 
recurrence is quite natural and increases m based on the distance between r and p: 

Recurrence A: m t+\ = (1 — r t + p)tnt ■ (32) 

The second recurrence exploits some experimental facts. In Fig. [2] we have plotted the conflict 
ratio functions for three CC graphs with the same size and average degree (note that initial 
derivative is the same for all the graphs, in accordance with Prop. [2]). We see that conflict 
ratios which reach a high value (f (ft) > \) are initially well approximated by a straight line 
(for m such that f (m) < p = 20 30%), whereas functions that deviates from this behavior do 
not raise too much. This suggests us to assume an initial linearity in controlling mt, as done 
by the following recurrence: 

Recurrence B: mf +1 = — m t . (33) 

The two recurrences can be roughly compared as follows (see Fig. |3): Recurrence A has 
a slower convergence than Recurrence B, but it is less susceptible to noise (the variance that 
makes r t realizations different from F f ). This is the reason for which we chose to merge 
them in an hybrid algorithm: initially, when the difference between r and p is big, we use 
Recurrence B to exploit its quick convergence and then Recurrence A is adopted, for a finer 
tuning of the control. 
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Algorithm V. Pseudo-code of the proposed hybrid control algorithm 



// Tunable parameters 

1 m = 2; m max = 1024; m min = 2; 

2 T = 4; r min = 3%; a = 25%; 
// Variables 

3 wz 4 — niQ) r 4— 0; f <— 0; 
// Main loop 

4 while nodes to elaborate ^ do 
t<-t + l; 

if m > m max then m <- m max ; 
else if m < m mm then m wim^; 
Launch the scheduler with m nodes; 
r 4— r + new conflict ratio; 
if (f mod T) = T - 1 then 
r <— r/T; 

r 



«1 = 6%; 



1 



if a > then 

if r < 

7"min then T 4 r n 

p. 



m 4— 



m 



else if a > ol\ then 

j m 4— \ (1 — r + p) m~\; 
r 4-0; 




Upper bound 
Random graph 
Cliques + discon. nodes 
Common tangent 



1 200 400 600 800 1000 1200 1400 1600 1800 2000 

m 

Figure 2: A plot of f(m) for some graphs with n = 2000 and d = 16: (i) the worst case upper 
bound of Cor. [2] (ii) a random graph (edges chosen uniformly at random until desired degree is 
reached; data obtained by computer simulation) (iii) a graph unions of cliques and disconnected 
nodes. 
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Figure 3: Comparison between two realizations of the hybrid algorithm and one that only uses 
Recurrence A, for two different random graphs (n = 2000 in both cases). The hybrid version has 
different parameters for m greater or smaller than 20. p was chosen to be 20%. The proposed 
algorithm proves to be both quick in convergence and stable. 



4.1 Experimental Evaluation 

In the practical implementation of the control algorithm we have made the following opti- 
mizations: 

• Since rt can have a big variance, especially when m is small, we decided to apply the 
changes to m every T steps, using the averaged values obtained in these intervals, to 
smooth the oscillations. 

• To further reduce the oscillations we apply a change only if the observed r t is sufficiently 
different from p (e.g. more than 6%), thus avoiding small variations in the steady state, 
which interfere with locality exploitation because of the data moving from one processor 
to another. 

• Another problem that must be considered is that for small values of m the variance is 
much bigger, so it is better to tune separately this case using different parameters (this 
optimization is not shown in the pseudo-code). 

To validate our controller we have run the following simulation: a random CC graph of 
fixed average degree d is taken and the controller runs on it, starting with niQ = 2. We are 
interested in seeing how many temporal steps it takes to converge to m t ~ ]i. As can be 
seen in ffl~5| the parallelism profile of many practical applications can vary quite abruptly, 
e.g., Delauney mesh refinement can go from no parallelism to one thousand possible parallel 
tasks in just 30 temporal steps. Therefore, an algorithm that wants to efficiently control the 
processors allocations for these problems must adapt very quickly to changes in the available 
parallelism. Our controller, that uses the very fast Recurrence B in the initial phase, proves 
to do a fast enough job: as shown in Fig. [3] in about 15 steps the controller converges close to 
the desired ]i value. 



5 Conclusions and Future Work 

Automatic parallelization of irregular algorithms is a rich and complex subject and will offer 
many difficult challenges to researchers in the next future. In this paper we have focused 
on the processor allocation problem for unordered data-amorphous algorithms; it would 
be extremely valuable to obtain similar results for the more general and difficult case of 
ordered algorithms (e.g., discrete event simulation), in particular it is very hard to obtain good 
estimates of the available parallelism for such algorithms, given the complex dependencies 
arising between the concurrent tasks. Another aspect which needs investigation, especially in 
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the ordered context, is whether some statical properties of the behavior of irregular algorithms 
can be modeled, extracted and exploited to build better controllers, able to dynamically adapt 
to the different execution phases. 

As for a real-world implementation, the proposed control heuristic is now being integrated 
in the Galois system and it will be evaluated on more realistic workloads. 
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